Learning Parameters of the K-Means Algorithm From Subjective Human Annotation
نویسندگان
چکیده
The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled “editorial” without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled “editorial” by
منابع مشابه
Optimizing the Grade Classification Model of Mineralized Zones Using a Learning Method Based on Harmony Search Algorithm
The classification of mineralized areas into different groups based on mineral grade and prospectivity is a practical problem in the area of optimal risk, time, and cost management of exploration projects. The purpose of this paper was to present a new approach for optimizing the grade classification model of an orebody. That is to say, through hybridizing machine learning with a metaheuristic ...
متن کاملA Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS
Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...
متن کاملExplain the theoretical and practical model of automatic facade design intelligence in the process of implementing the rules and regulations of facade design and drawing
Artificial intelligence has been trying for decades to create systems with human capabilities, including human-like learning; Therefore, the purpose of this study is to discover how to use this field in the process of learning facade design, specifically learning the rules and standards and national regulations related to the design of facades of residential buildings by machine with a machine ...
متن کاملScalable Image Annotation by Summarizing Training Samples into Labeled Prototypes
By increasing the number of images, it is essential to provide fast search methods and intelligent filtering of images. To handle images in large datasets, some relevant tags are assigned to each image to for describing its content. Automatic Image Annotation (AIA) aims to automatically assign a group of keywords to an image based on visual content of the image. AIA frameworks have two main sta...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011